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ABSTRACT 


Characterizing the nature of students’ affective and emotional states 
and detecting them is of fundamental importance in online course 
platforms. In this paper, we study this problem by using discussion 
forum posts derived from large open online courses. We find that 
posts identified as encoding confusion are actually manifestations 
of different learner affects pertaining to their informational needs— 
primarily seeking factual answers. We quantitatively demonstrate 
that the use of content-related linguistic features and community- 
related features derived from a post serve as reliable detectors of 
confusion while widely outperforming currently available algorithms 
of confusion detection. We also point out that several prediction 
tasks in this domain (e.g., confusion and urgency detection) can be 
correlated, and that a model trained for one task can effectively be 
used for making predictions on the other task without requiring la- 
beled examples. Finally, we highlight a very significant problem of 
adapting the classifier to unseen courses. 
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1. INTRODUCTION 


Discussion fora constitute a central feature of learner interaction in 
online course platforms, where learners post questions, opinions, 
and concerns, which are viewed, rated and answered by fellow- 
learners and/or teaching staff. In the particular instance of courses 
affording only virtual interactions, such as at-scale learning envi- 
ronments, forum posts constitute rich repositories of students’ af- 
fective and emotional states captured in real time. The focus of 
this study is on characterizing the nature of students’ affective and 
emotional states, manually identified as confusion in forum posts 
and developing automatic methods to detect them. Here, as in [25] 
and [2], we operationalize the definition of confusion as a state in 
which a student hits an impasse and is uncertain of how to move 
forward. As such, the reasons for confusion could be attributed to 
lack of clarity on the topic discussed or technical shortcomings of 
the learning interface, among others. Examples of such posts are 
shown in Table 1. 
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Table 1: Posts representing confusion and its absence. 
Thave also problems with the section “Pre course Survey” 
Ihave completed this section several times about 10, I have 

the final message “Thanks” but at each new connection appears 
in my courseware “pre course Survey (please complete)” Please 
help me, what I have to do ? (Confusion) 

Interesting! How often we say those things to others without 
really understanding what we are saying. That must have been a 
powerful experience! Excellent! (No confusion) 


The strong connection between learner affect, engagement, and 
learning outcomes has long been understood but studies on their 
effect on continued participation in internet-based learning envi- 
ronments such as MOOCs is only emerging (e.g.,[25, 2]). In ad- 
dition to constituting supporting evidence to understand this asso- 
ciation, mechanisms to automatically detect learner affect encoded 
via confusion in discussion fora serve the following ends. Firstly, 
they inform us about the aspects of a course that are frustrating for 
learners and hence need improvement [24, 21, 11]. Second, they 
can aid a timely and accurate intervention to struggling learners by 
providing critical insights into their emotional states[25], eventu- 
ally leading to success of this critical course component. 


For instance, when a student expresses confusion or misunderstand- 
ing about a concept, the immediacy with which the confusion is ad- 
dressed impacts student satisfaction and course progress. Because 
of this, and the demands of an at-scale learning environment, effi- 
cient and automatic detection of confusion has become more im- 
portant than ever before. With a steady increase in the number of 
courses on online course catalogs, and with limited means to con- 
trol the instructor-to-student ratio in online platforms, the problem 
of detecting confusion as expressed in online fora is timely. Despite 
the critical need, relatively few studies analyze confusion in course 
discussion forum posts [25, 2]. 


While the explicit purpose of discussion fora is to engage the users 
in a way that develops a sense of community and communication 
within large-scale online courses, the posts themselves serve as 
proxy for learner affect and emotions expressed in various forms. 
Detecting this encoded affect from posts is an important challenge 
for natural language processing algorithms. This is because, at the 
outset, a post indicating confusion could be construed to be a ques- 
tion. Since question posts and confusion posts—forms of informa- 
tion seeking behavior—are remarkably similar, one would expect 
that approaches to detect questions (e.g., [7]) ought to be directly 
applicable. However, this is not always the case. Many times con- 
fusion posts do not have an explicit question making the two prob- 
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lems of question detection and confusion detection closely related 
but not the same. This makes the detection of confusion in a post 
a non-trivial problem partly because, for posts containing a ques- 
tion, the questions tend to occur with other declarative sentences. 
A second difficulty is the use of different question styles (informal, 
where standard features such as the question mark are likely to be 
absent or where the question is worded without a question mark). 
Hence, simple heuristics of using question mark or 5W1H words 
(who, what, which, where, why, how) are rendered inadequate. 


Additionally, as observed in [18], finding patterns to identify non- 
questions is more challenging than finding patterns in questions 
(since they usually do not share common lexical and/or syntactic 
patterns). This is directly applicable to confusion posts where posts 
not indicative of confusion have diverse intent. 


Prior studies in this direction (e.g., [6, 2, 25]) have led to the use 
of linguistic and structural features available from the discussion 
forum. While similar in spirit to these prior studies, this study sets 
itself apart from them in many ways. Firstly, we identify that confu- 
sion detection is different from simple/complex question detection. 
In order to solve this problem more effectively, we point out that the 
community needs a characterization of confusion instead of treat- 
ing it as yet-another text-classification task. We present an in-depth 
analysis of types of ‘confused posts’ using high-quality and reli- 
able manual annotations (Section 4). Motivated by this analysis, 
we then design features to detect confusion automatically in a su- 
pervised framework. We also point out that several prediction tasks 
in this domain (such as confusion and urgency detection) are corre- 
lated, and demonstrate that a model trained for one task can effec- 
tively be utilized for making predictions on the other task without 
requiring labeled examples. Finally, we highlight a very significant 
problem concerning the applicability of such classifiers to unseen 
courses. We summarize our contributions below: 


Characterizing affective states and informational needs: We ob- 
serve that nearly half of the posts encoding confusion and consid- 
ered urgent pertain to users seeking answers to factual questions. 
Aside from indicating an information need, these posts are also 
used to report course-specific issues such as concerns with assign- 
ments or quizzes as well as to report course-related technical issues 
(e.g., unavailability of a lecture video or a peer-assessment grade). 


Efficient confusion detection: We quantitatively demonstrate that 
our use of content-related linguistic features of a post and a set 
of community-related features associated with it serve as reliable 
detectors of confusion while widely outperforming currently avail- 
able algorithms of confusion detection. 


Combined confusion and urgency detection: We show that the 
trained confusion classifier also functions as an efficient urgency 
detector when tested on confusion posts also labeled as ‘Urgent’. 


Scaling the effort to other courses and domains: Based on the 
dataset, we make concrete suggestions to explore domain adapta- 
tion towards building course-generic classifiers. Rather than aim- 
ing for course-independent classifiers, our proposal is to harness the 
utility of available course-specific classifiers for an unseen course, 
based on suitably defined cross-domain similarities. 


By means of a thorough quantitative evaluation of our proposed 
features in a supervised machine learning model, we demonstrate 
its effectiveness as a scalable and efficient model for automatic de- 
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tection of confusion that generalizes well to unseen courses. 


2. RELATED PRIOR WORK 


Confusion and its impact on learners: Studies modeling con- 
fusion and exploring its relation to learner affect have found that 
even though students seem to struggle when confused, the situation 
leads them to attempt to resolve barriers to their understanding of 
complex concepts [16, 10, 8]. However, it has also been pointed 
out that remaining confused has a negative effect that leads to stu- 
dent disengagement and eventual dropout, thus making it impera- 
tive that confusion be resolved immediately [15, 25]. This neces- 
sity is more immediate in the context of learning at scale given the 
impersonal and the distant nature of the learning process[14, 19]. 
Thus, detecting learner affect, particularly with respect to under- 
standing the material has the potential to contribute to the design 
of interventions as shown in prior studies (e.g.,[9, 22]) can lead to 
increased learning effectiveness in computer-based learning envi- 
ronments such as online courses. 


Detecting confusion: Focusing on MOOCSs, where the only venue 
for learner-instructor interaction is the discussion forum, studies 
are now beginning to explore automated mechanisms to provide 
timely learner support by analyzing forum content. These include, 
predicting when instructor intervention is needed [5, 6], monitoring 
student’s opinion towards the course [20], recommending questions 
to users for assisting students seeking answers [23], identifying ac- 
ceptable answers [13], organizing the forum content into aspects or 
topics along with their sentiments to help instructors in promptly 
addressing common issues [17], identifying posts that express con- 
fusion to predict points of eventual student dropout [25], and de- 
tecting posts that express confusion to then map confused posts to 
course video clips as a way to automate interventions [2]. A com- 
mon feature of these approaches to detect confusion is their reliance 
on textual and structural features of the discussion forums to design 
effective algorithms. 


While [25] uses a set of linguistic features to detect confusion, it 
disregards the structural features (e.g. the number of times a post 
has been read or the number of up-votes) that are found to be use- 
ful in detecting the informational need or urgency [6], [2] uses a 
set of structural features in combination with a linguistic feature 
in addition to also relying on the other dimensions of a post, such 
as expression of a sentiment and the sense of urgency. This latter 
reliance on the other dimensions is not realistic given the manual 
effort of assigning the labels for sentiment and urgency (needed 
to design corresponding classifiers). Our study shares similarities 
with these prior studies in that we rely on the discussion forum 
information, but differs from them by the use of a novel set of fea- 
tures that encode content-related aspects of forum posts to account 
for and structural aspects of the forum posts. 


We compare the performance of our detection approach to that in 
[2] and show that our approach outperforms current state-of-the- 
art by a wide margin both in-domain and across course domains. 
In addition, differing from prior work, we show that our confusion 
classifier can simultaneously detect urgency, thereby addressing the 
need for immediacy for learning effectiveness. 


3. DATA DESCRIPTION 


The forum posts analyzed in this study are from the Stanford MOOC 
Posts dataset, a corpus composed of 29,604 anonymized learner fo- 
rum posts from eleven Stanford University public online classes 
[1]. The posts are taken from three course domains: Humani- 
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Table 2: Summary of posts from the three discussion forums 


Category | No. of Posts [| Not Confused | Confused | Confused & Urgent (%) | No. of sentences per post (mean, sd) 
Education 9878 6714 640 67.5 (6.6, 2.8) 
Humanities 9723 1358 2257 86.4 (4.5, 4.7) 

Medicine T0001 1581 1598 38.9 (4.3, 3.7) 


ties/Sciences, Medicine, and Education, with about 10,000 posts 
in each set. 


A salient feature of the dataset is that each post is available with 
manually assigned labels for six dimensions indicating confusion, 
urgency, question, opinion, answer, and sentiment. We encourage 
the readers to refer to [1] for more details. In our study, we only 
consider the dimensions of Confusion and Urgency: 


Confusion - encodes the extent to which the post expresses confu- 
sion, on a scale of | (expert knowledge) to 7 (extreme confusion); 


Urgency - denoting the extent to which the post is interpreted to 
be urgent and requires that an instructor respond to the post with | 
denoting ‘not urgent at all’ and 7 denoting ‘extremely urgent’; 


We divide the posts into two groups—“confused” and “not con- 
fused” based on their gold Confusion scores. A score above 4 is 
considered a Confused post, whereas a score below 4 is regarded 
as a Not confused one (we disregard posts with score = 4 from 
the analyses). Likewise, an Urgency score above 4 is regarded as 
an Urgent post, whereas a score of 4 and below is regarded as a 
non-Urgent post. A summary of the data set is provided in Table 2. 


4. CHARACTERIZING CONFUSION 


To understand how confusion is expressed in forum posts, two of 
the authors independently coded a random sample of 200 posts 
from the entire data set for the following 6 types: 


1. Factual, if the post seeks clarification of a factual aspect of 
the course material, as in the post, “Does this mean logis- 
tic regression always gives adjusted ratios and the manually 
computed ratios are unadjusted?” 

2. Course-specific, if the user seeks a course-specific clarifi- 
cation, such as “Dear Staff, Can you give atleast 2 attempts 
for each quiz. Giving only one attempt is making us loose 
interest in the course. Kindly consider.” 

3. Course-technical, if the user seeks clarification on technical 
aspects of the course. For example, “I am trying to download 
5.R.RData, but I cannot open it, can please let me know how 
I can open this file. With kind regards,” 

4. Recommendation, if the user is seeking a recommendation. 
For instance, consider the following post. “another question 
would you use this form throughout the whole essay? or 
would you shorten it after using the full phrase?” 

5. Frustration, where the user expresses frustration, as in, “I 
had the same issue. Am I bad at finding the check button and 
bad at math???” 

6. Other, for posts that belong to none of the above 5 types. 


The inter-rater reliability, k, was 0.81. Based on the instances 
where both coders agreed, we characterize the type of posts. True 
to the fact that the discussion forum is an avenue for learners to seek 
learning support from fellow learners, the most popular post type is 
Factual (54% of the annotated posts), where learners seek to clarify 
their misunderstandings of concepts presented in the course. This 
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post type is then followed by Course specific (27%) and Course 
technical (12%). The remaining posts were categorized as Recom- 
mendation (3%), Frustration (2%) and Other (2%). 


Overall, these observations confirm that posts indicative of confu- 
sion need to be addressed in a timely manner; even though some 
of them may not be explicit questions, they echo the information 
seeking nature and the uncertainty encoded in posts that are ex- 
plicit questions. Additionally, we hypothesize that the inherent dif- 
ference in the nature of affective states encoded as confusion could 
be responsible for the inconclusive nature of the effect of confusion 
on learning outcomes (e.g., confusion positively impacting learning 
in [10] and negatively impacting outcomes in [25]. 


5. DETECTING CONFUSION 


Our next focus is on building a confusion detector that will allow 
for automatic identification of confusing posts to facilitate immedi- 
ate response thereby enhancing the learning experience and reduc- 
ing learner frustration. Towards this end, the confusion-detection 
features can be grouped into two categories: content-related and 
community-related features. 


Content-related features: These features analyze the textual con- 
tent of the post: 


1. Automated readability index (ARI): Readability indices are 
designed to measure how understandable a piece of text is. 
We hypothesize that the posts encoding confusion, owing 
to their information seeking nature as well as owing to the 
tendency of learners to post verbatim course content, have 
higher readability indices (i.e., are more difficult to read) than 
those posts that do not encode confusion. 

2. Post length in words; 

3. Unigrams: These binary features encode whether a word oc- 
curred in the post or not. 

4. Topicality (LDA): These features use supervised Latent Dirich- 
let Allocation (LDA) [4] to generate the LDA labels as fea- 
tures. Towards this, we first perform a preprocessing step in- 
volving stop-word removal (including numbers and punctua- 
tion); stemming; and removing high-frequency (top 1%) and 
low-frequency words (occurring fewer than 5 times). Then 
a supervised LDA (sLDA) model is obtained with the con- 
fusion labels. Here we use the confusion labels for each 
post to obtain two sets of LDA words (associated with pres- 
ence/absence of confusion). This model predicts a label (con- 
fusion or not) based on the words in the post that occur in the 
respective LDA set. 

5. Question mark: Since confusion is often expressed via ques- 
tions, this feature checks for presence of a question mark. 


Community-related features: A second set of predictors of whether 
a post encodes confusion or not is obtained by observing how the 
community of learners reacts to a post. In particular, a post that is 
of general interest to learners (such as one that is seeking a factual 
clarification, or that seeks resolution for a course-related technical 
problem) would be read by several viewers, thus leading to a rela- 
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Table 3: Performance of our approach and the two baselines. ‘NR’ stands for results that were not reported in the respective paper. 


Course Model Accuracy | Precision | Recall | F-measure | Cohen’s Kappa 
Our Model 84.38 90.38 7116 83.14 0.69 
Humanities | Unigrams Model[3] 71.99 71.00 82.21 75.28 0.44 
YouEDU[2] NR 77.80 64.20 70.00 0.62 
Our Model 80.04 79.44 81.02 80.00 0.60 
Education Unigrams Model[3] 82.03 78.76 87.81 82.96 0.64 
YouEDU[2] NR NR NR 38.30 0.36 
Our Model 83.75 86.67 80.14 83.16 0.67 
Medicine Unigrams Model[3] 70.39 72.82 65.33 68.69 0.41 
YouEDU[2] NR 69.90 58.90 62.70 0.56 


tively higher number of reads. Likewise, posts encoding confusion 
are considered important resulting in higher up-votes. Accordingly, 
our set of features includes the number of (i) reads and (ii) up-votes 
of the post. 


We cast the task of confusion detection as one of binary classi- 
fication, where posts expressing confusion constitute the positive 
class. For the purpose of this study we do not use the confusion- 
types identified in the characterization. We trained an Elastic-net 
model, which is a regularization approach that uses a mixture of L; 
and Lz penalties to perform variable selection [26]. 


6. EXPERIMENTS 


Datasets: From Table 2 we can see that for majority of the courses, 
the data is biased towards the negative (not-confusion) class. This 


makes learning difficult, especially for the positive (confusion) class. 


In order to alleviate this problem, for each course, we down-sample 
the negative class (randomly) such that the two classes are bal- 
anced. Additionally, forum posts from ‘Education’, contains very 
few (640) confusion posts. This resulted in a very small resampled 
dataset for this course (compared to the posts in Humanities and 
Medicine) after down-sampling the negative class. Noting that this 
dataset was prone to over-fitting due to very few posts as compared 
to the number of features, we up-sampled the positive class to twice 
its original size before down-sampling the negative class as before. 


We also tokenized the content of the posts; removed stopwords 
(175 unique words); stemmed [12]; and removed infrequent words 
(with count less than 5). The final vocabulary lists for these courses 
contained about 2400, 1400, and 1750 words respectively. 


Evaluation Measure: From the perspective of helping students, 
the positive (confusion) class, indicative of learner affect, is more 
important than the negative class. An ideal classifier would, there- 
fore, identify all confusion posts bringing them to the instructor’s 
attention (high recall for the positive class). Additionally, a high 
precision for the positive class is also important so that the instruc- 
tor’s efforts are not wasted in analyzing false-positives. Therefore, 
it seems natural to evaluate models using the F-measure of the pos- 
itive class (in-line with related prior work). For the sake of com- 
pleteness, we also report accuracy and Cohen’s Kappa. 


6.1 Confusion Detection 

Table 3 compares 10-fold CV results of our model with two promi- 
nent baselines: (i) Unlike our model, our first baseline [2] uses 
manual annotation for dimensions such as Opinion and Question 
(apart from ground truth confusion labels for training). We include 
their performance as reported in their paper. (ii) The second base- 
line [3] uses only Unigram features. We replicated this baseline 
in our experiments. Also, a random baseline would get a score of 


50%. However, we do not include this result in the tables for clarity. 


We can see that, for Humanities and Medicine, our model performs 
significantly better than the baselines. For instance, for the Hu- 
manities course, our model achieves 10.4% and 18.8% relative im- 
provements in F-measure over the two baselines. Similarly, on the 
Medicine course, our model achieves 21.1% and 32.3% relative 
improvements in F-measure. Our model’s Cohen’s Kappa (and ac- 
curacy when reported) are also better than the baselines. This in- 
dicates the utility of our features in not only learning the positive 
class, but also performing well on the overall classification task. 


For the Education course, our model outperforms the YouEDU[2] 
model significantly. Our model achieves an F-measure of 80.0% as 
opposed to only 38.3% by the YouEDU model. We would like to 
remind the reader that the data for the Education course was partic- 
ularly skewed towards the negative class (not-confusion) with only 
6.5% of the posts belonging to the positive class (confusion). This 
stark difference in performances of the two models, emphasizes the 
need for models that can pay particular emphasis on the minority 
class, which in this case is more significant than the majority class. 


Interestingly, for this course, the performance of our model is com- 
parable to the unigrams model [3], with the latter performing slightly 
better. Both the models use the same dataset and so neither suffers 
from the rare-class problem. The seemingly disadvantageous na- 
ture of our features for this course is not consistent with the results 
obtained for the other two courses, and requires further investiga- 
tion. However, in general, the features proposed in our approach 
provide a considerable boost in performance. 


6.2 Effect of Degree of Confusion 

As mentioned in the data description, the dimension of Confusion 
was annotated on a scale of 1-7 (denoting the degree of confusion), 
which could be potentially construed to correspond to a scale of 
affective states. While we had conflated all the positive confusion 
levels (rep. negative levels) for the purpose of detection, here we 
evaluated the performance of our detector on its ability to detect 
the degree of confusion. We examined the performance (here, ac- 
curacy) at every Confusion degree and report the results in Table 5. 
We observe that the accuracy monotonically increases with confu- 
sion level, suggesting the classifiers suitability for real applications 
(e.g., potentially informative to instructional designers). 


6.3 Feature ablation analysis 

Table 4 compares the predictive importance of our various features 
by removing them one at a time. For convenience, the first row for 
each course depicts the performance with the full feature set (same 
as Table 3). From the table, ‘Unigram’ and ‘Question-mark’ seem 
to be the most valuable. For instance, the model for Education re- 


Proceedings of the 10th International Conference on Educational Data Mining 275 


Table 4: Feature ablation. For each course, the top row corresponds to the complete feature set. The subsequent rows represent 
performance with one of the features removed. Removing any feature (except ‘LDA’) decreases performance, indicating its utility. 


Course Feature-class Removed Feature | Accuracy | Precision | Recall | F-measure | Cohen’s Kappa 
- None 84.38 90.38 77.16 83.14 0.69 
Communigzrelated Number of Reads 84.16 89.86 TT.24 82.96 0.54 

Score 84.24 89.86 77.40 83.06 0.55 

Humanities ARI 84.12 89.66 TTA0 82.95 0.55 

Post Length 83.76 89.40 76.88 82.53 0.54 

Content-related Unigrams 80.73 88.44 70.82 78.53 0.24 

LDA 84.52 90.01 78.05 83.43 0.70 

Question Mark 70.91 72.64 73.03 72.00 0.53 

- None 80.04 79.44 81.02 80.00 0.60 
Community-related Number of Reads 76.99 75.67 79.30 77.26 0.54 

Score 77.46 75.98 80.16 77.88 0.55 

Education ARI 71.62 76.04 80.47 78.04 0.55 
Post Length 76.88 75.42 79.53 77.25 0.54 

Content-related Unigrams 62.15 60.67 69.77 64.70 0.24 

LDA 85.16 83.33 87.97 85.44 0.70 

Question Mark 76.64 73.86 82.58 77.88 0.53 

- None 83.75 86.67 80.14 83.16 0.67 
Community-related Number of Reads 83.65 86.41 80.20 83.10 0.67 

Score 83.72 86.49 80.26 83.17 0.67 

Medicine ARI 83.78 86.51 80.39 83.25 0.68 
Post Length 83.72 86.59 80.14 83.13 0.67 

Content-related Unigrams 80.43 86.65 72.93 78.91 0.61 

LDA 83.62 86.06 80.64 83.14 0.67 

Question Mark 70.04 73.90 62.49 67.55 0.40 


Table 5: Accuracy of the model in detecting Confusion at dif- 
ferent levels. Numbers in () show number of instances. Per- 
formance improves with increasing scores. Confusion at levels 
higher than 5.5 did not have sufficient instances. 
Course 4.5 5 5.5 

Education |0.76 (521) 0.80 (93) | 0.87 24) 
Humanities|0.69 (463)|0.79 (553)|0.79 (190) 
Medicine |0.71 (641)|0.86 (762)| 0.90(154) 


lies heavily on the Unigram features (removing which decreases 
the F-measure from 80% to 64.7%). Removing any of the other 
features like ‘Number of reads’, ‘Post Length’ also hurt model per- 
formance, albeit to a lower degree. Experiments reveal that the 
inclusion of LDA as a feature hurts more than helping the model’s 
performance. Overall, we can conclude that removing most of our 
features reduces the performance of the model to various degrees, 
indicating their utility. 


6.4 Testing on Unseen Courses 

Our supervised model requires having labeled training data. How- 
ever, considering the short duration of most online courses, man- 
ually annotations for an ongoing course is not only expensive but 
also infeasible due to time and privacy constraints. Hence, domain- 
independence of such classifiers is extremely desirable. In our next 
experiment, we test a given model on an unseen course in order 
to estimate the domain-independence of existing methods. Table 6 
shows the results of this experiment. The last column of the table 
shows the change in model’s performance when tested on a course 
not seen during training. We can see that the model performance 
always decreases when it is tested on a new course. However, the 
decrease can be expected to depend on the difference in the class- 
conditional distributions of the train and the test sets. From this 
perspective, one could argue that the post from Humanities and 
Medicine are more similar to each other than to the posts from 
Education, as far as this task is concerned. From instance, when 
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a model trained on data from Humanities is tested on data from 
Medicine, and vice-versa, the decrease in F-measure is only about 
of 4 points. On the other hand, the model suffers a much greater de- 
crease in performance when it is trained on data from Medicine (or 
Humanities) and is tested on data from Education, and vice-versa. 


This result indicates that domain-adaptation methods, that aim to 
build course-independent classifiers, should not blindly aim for clas- 
sifiers that perform well on all courses. Instead, a more opportunis- 
tic alternative would be based on assessing the similarity between 
the data from the source (training) and the target (testing) courses. 


6.5 Urgency Prediction 

In Table 2 we can see that there is a high correlation between the 
‘Confused’ and ‘Urgent’ labelings. For instance, 86.4% of the 
posts from Humanities labeled as ‘Confused’ are also labeled as 
‘Urgent’. Therefore, it would be of interest to investigate how 
well a model trained for detecting confusion would perform on the 
task of detecting urgency. Table 7 shows the results of this experi- 
ment. For this table we train our model using ground-truth Confu- 
sion labeling, and use the trained model to make predictions on the 
test instances. We then judge model’s performance by comparing 
predicted positive/negative class with the ground truth Urgent/not- 
urgent class. Note that we use urgent/not-urgent labelings only dur- 
ing evaluation and not training. Like before, we are primarily inter- 
ested in the F-measure of the positive (urgent) class. From the table 
we can see that we achieve a reasonably high F-measure especially 
for Humanities (75.78%) and Medicine (80.68%). This suggests 
that for the two related tasks, classifiers trained for one task could 
be used for the other task with little modifications. 


7. FUTURE DIRECTIONS 


We have presented detailed analysis of posts indicative of confu- 
sion from a collection of discussion forum posts from learners on 
online courses spanning 3 domains. Our detailed manual analy- 
sis of the types of confusion posts suggests that subsequent explo- 
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Table 6: Model performance decreases when tested on unseen courses. Performance drops indicate a need for more aggressive 


domain-adaptation efforts on diverse pairs (like Education-Humanities), as compared to similar ones (Humanities-Medicine). 
train-Course | test-Course | Acc. | Precision | Recall | F-measure | Kappa | Change in F-measure 
Humanities 84.38 90.38 77.16 83.14 0.69 - 
Education Humanities | 70.25 67.86 76.95 72.12 0.40 -11.02 
Medicine 79.16 78.95 79.53 79.24 0.58 -4.10 
Education 80.04 79.44 81.02 80.00 0.60 - 
Humanities Education 71.88 81.60 56.48 66.76 0.44 -13.24 
Medicine 70.82 TIAT 59.14 66.96 0.42 -13.04 
Medicine 83.75 86.67 80.14 83.16 0.67 - 
Humanities Medicine 81.06 87.03 73.00 79.40 0.62 -3.76 
Education 65.15 61.40 81.59 70.07 0.30 -13.09 
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